Michael Smith, Alex Frye, Chris Boomhower ----- 1/29/2017

Describe the purpose of the data set you selected
The data set selected by our group for this lab primarily consists of Citi Bike trip history collected and released by NYC Bike Share, LLC and Jersey Bike Share, LLC under Citi Bike's NYCBS Data Use Policy. Citi Bike is America's largest bike share program, with 10,000 bikes and 600 stations across Manhattan, Brooklyn, Queens, and Jersey City... 55 neighborhoods in all. As such, our data set's trip history includes all rental transactions conducted within the NYC Citi Bike system from July 1st, 2013 to February 28th, 2014. These transactions amount to 5562293 trips within this time frame. The original data set includes 15 attributes. That being said, our team was able to derive 15 more attributes from the original 15 as disussed in detail in the next section. Of particular note, however, we merged NYC weather data from the Carbon Dioxide Information Analysis Center (CDIAC) with the Citi Bike data to provide environmental insights into rental behavior as well.
The trip data was collected via Citi Bike's check-in/check-out system among 330 of its stations in the NYC system as part of its transaction history log. While the non-publicized data likely includes further particulars such as rider payment details, the publicized data is anonymized to protect rider identity while simultaneously offering bike share transportation insights to urban developers, engineers, academics, statisticians, and other interested parties. The CDIAC data, however, was collected by the Department of Energy's Oak Ridge National Laboratory for research into global climate change. While basic weather conditions are recorded by CDIAC, as included in our fully merged data set, the organization also measures atmospheric carbon dioxide and other radiatively active gas levels to conduct their research efforts.
Our team has taken particular interest in this data set as some of our team members enjoy both recreational and commute cycling. By combining basic weather data with Citi Bike's trip data, we expect to be able to predict whether riders are more likely to be (or become) Citi Bike subscribers based on ride environmental conditions, the day of the week for his/her trip, trip start and end locations, the general time of day (i.e. morning, midday, afternoon, evening, night) of his/her trip, his/her age and gender, etc. Deeper analysis may even yield further insights, such as identifying gaps in station location, for example. Furthermore, quantifiable predictions such as a rider's age as a function of trip distance and duration given other factors would provide improved targeting to bike share marketing efforts in New York City. Likewise, trip duration could be predicted based on other attributes which would allow the company to promote recreational cycling via factor adjustments within its control. By leveraging some of the vast number of trip observations as training data and others as test data via randomized selection, we expect to be able to measure the effectiveness of our algorithms and models throughout the semester.
Describe the meaning and type of data
Before diving into each attribute in detail, one glaring facet of this data set that needs mentioning is its inherent time-series nature. By no means was this overlooked when we decided upon these particular data. To mitigate the effects of time on our analysis results, we have chosen to aggregate time-centric attributes such as dates and hours of the day by replacing them with simply the day of the week or period of the day (more on these details shortly). For example, by identifying trips occurring on July 1st, 2013, not by the date of occurrence but rather the day of the week, Monday, and identifying trips on July 2nd, 2013, as occurring on Tuesday, we will be able to obtain a "big picture" understanding of trends by day of the week instead of at the date-by-date level. We understand this is not a perfect solution since the time-series component is still an underlying factor in trip activity, but it is good enough to answer the types of questions we hope to target as described in the previous section as we will be comparing all Mondays against all Tuesdays, etc.
As mentioned previously, the original data set from Citi Bike included 15 attributes. These 15 attributes and associated descriptions are provided below:
It is important to note that birth year and gender details are not available for "Customer" user types but rather for "Subscriber" riders only. Fortunately, these are the only missing data values among all trips in the data set. Unfortunately, however, it means that we will not be able to identify the ratio of males-to-females that are not subscribed or use age to predict subcribers vs. non-subscribers (Customers). More to this end will be discussed in the next section.
It is also worth mentioning that while attributes such as trip duration, start and end stations, bike ID, and basic rider details were collected and shared with the general public, care was taken by Citi Bike to remove trips taken by staff during system service appointments and inspections, trips to or from "test" stations which were employed during the data set's timeframe, and trips lasting less than 60 seconds which could indicate false checkout or re-docking efforts during checkin.
Because some attributes may be deemed as duplicates (i.e. start_station_id, start_station_name, and start_station_latitude/longitude for identifying station locations), we chose to extract further attributes from the base attributes at hand. Further attributes were also extracted to mitigate the effects of time. In addition, we felt increased understanding could be obtained from combining weather data for the various trips as discussed in the previous section. These additional 10 attributes are described below:
After extracting our own attributes and merging weather data, the total number of attributes present in our final data set is 25. Only 15 are used throughout this lab, however, due to the duplicate nature of some attributes as discussed already. This final list of used attributes are tripduration, DayOfWeek, TimeOfDay, HolidayFlag, start_station_name, start_station_latitude, start_station_longitude, usertype, gender, Age, PRCP, SNOW, TAVE, TMAX, and TMIN.
To begin our analysis, we need to load the data from our source .csv files. Steps taken to pull data from the various source files are as follows:
Below you will see this process, as well as import/options for needed python modules throughout this analysis.
import os
import sys
import re
from geopy.distance import vincenty
import holidays
from datetime import datetime
from dateutil.parser import parse
import glob
import pandas as pd
import numpy as np
from IPython.display import display
import gmaps
import plotly as py
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import statistics
from scipy.stats.stats import pearsonr
py.offline.init_notebook_mode()
pd.options.mode.chained_assignment = None
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
starttime = datetime.now()
print(starttime)
if os.path.isfile("Compiled Data/dataset1.csv"):
print("Found the File!")
else:
citiBikeDataDirectory = "Citi Bike Data"
citiBikeDataFileNames =[
"2013-07 - Citi Bike trip data - 1.csv",
"2013-07 - Citi Bike trip data - 2.csv",
"2013-08 - Citi Bike trip data - 1.csv",
"2013-08 - Citi Bike trip data - 2.csv",
"2013-09 - Citi Bike trip data - 1.csv",
"2013-09 - Citi Bike trip data - 2.csv",
"2013-10 - Citi Bike trip data - 1.csv",
"2013-10 - Citi Bike trip data - 2.csv",
"2013-11 - Citi Bike trip data - 1.csv",
"2013-11 - Citi Bike trip data - 2.csv",
"2013-12 - Citi Bike trip data.csv",
"2014-01 - Citi Bike trip data.csv",
"2014-02 - Citi Bike trip data.csv"
]
weatherDataFile = "Weather Data/NY305801_9255_edited.txt"
citiBikeDataRaw = []
for file in citiBikeDataFileNames:
print(file)
filepath = citiBikeDataDirectory + "/" + file
with open(filepath) as f:
lines = f.read().splitlines()
lines.pop(0) # get rid of the first line that contains the column names
for line in lines:
line = line.replace('"', '')
line = line.split(",")
sLatLong = (line[5], line[6])
eLatLong = (line[9], line[10])
distance = vincenty(sLatLong, eLatLong).miles
line.extend([distance])
## Monday = 0
## Tuesday = 1
## Wednesday = 2
## Thursday = 3
## Friday = 4
## Saturday = 5
## Sunday = 6
if parse(line[1]).weekday() == 0:
DayOfWeek = "Monday"
elif parse(line[1]).weekday() == 1:
DayOfWeek = "Tuesday"
elif parse(line[1]).weekday() == 2:
DayOfWeek = "Wednesday"
elif parse(line[1]).weekday() == 3:
DayOfWeek = "Thursday"
elif parse(line[1]).weekday() == 4:
DayOfWeek = "Friday"
elif parse(line[1]).weekday() == 5:
DayOfWeek = "Saturday"
else:
DayOfWeek = "Sunday"
line.extend([DayOfWeek])
##Morning 5AM-10AM
##Midday 10AM-2PM
##Afternoon 2PM-5PM
##Evening 5PM-10PM
##Night 10PM-5AM
if parse(line[1]).hour >= 5 and parse(line[1]).hour < 10:
TimeOfDay = 'Morning'
elif parse(line[1]).hour >= 10 and parse(line[1]).hour < 14:
TimeOfDay = 'Midday'
elif parse(line[1]).hour >= 14 and parse(line[1]).hour < 17:
TimeOfDay = 'Afternoon'
elif parse(line[1]).hour >= 17 and parse(line[1]).hour < 22:
TimeOfDay = 'Evening'
else:
TimeOfDay = 'Night'
line.extend([TimeOfDay])
## 1 = Yes
## 0 = No
if parse(line[1]) in holidays.UnitedStates():
holidayFlag = "1"
else:
holidayFlag = "0"
line.extend([holidayFlag])
citiBikeDataRaw.append(line)
del lines
with open(weatherDataFile) as f:
weatherDataRaw = f.read().splitlines()
weatherDataRaw.pop(0) # again, get rid of the column names
for c in range(len(weatherDataRaw)):
weatherDataRaw[c] = weatherDataRaw[c].split(",")
# Adjust days and months to have a leading zero so we can capture all the data
if len(weatherDataRaw[c][2]) < 2:
weatherDataRaw[c][2] = "0" + weatherDataRaw[c][2]
if len(weatherDataRaw[c][0]) < 2:
weatherDataRaw[c][0] = "0" + weatherDataRaw[c][0]
citiBikeData = []
while (citiBikeDataRaw):
instance = citiBikeDataRaw.pop()
date = instance[1].split(" ")[0].split("-") # uses the start date of the loan
for record in weatherDataRaw:
if (str(date[0]) == str(record[4]) and str(date[1]) == str(record[2]) and str(date[2]) == str(record[0])):
instance.extend([record[5], record[6], record[7], record[8], record[9]])
citiBikeData.append(instance)
del citiBikeDataRaw
del weatherDataRaw
# Final Columns:
# 0 tripduration
# 1 starttime
# 2 stoptime
# 3 start station id
# 4 start station name
# 5 start station latitude
# 6 start station longitude
# 7 end station id
# 8 end station name
# 9 end station latitude
# 10 end station longitude
# 11 bikeid
# 12 usertype
# 13 birth year
# 14 gender
# 15 start/end station distance
# 16 DayOfWeek
# 17 TimeOfDay
# 18 HolidayFlag
# 19 PRCP
# 20 SNOW
# 21 TAVE
# 22 TMAX
# 23 TMIN
maxLineCount = 250000
lineCounter = 1
fileCounter = 1
outputDirectoryFilename = "Compiled Data/dataset"
f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
for line in citiBikeData:
if lineCounter == 250000:
print(f)
f.close()
lineCounter = 1
fileCounter = fileCounter + 1
f = open(outputDirectoryFilename + str(fileCounter) + ".csv", "w")
f.write(",".join(map(str, line)) + "\n")
lineCounter = lineCounter + 1
del citiBikeData
endtime = datetime.now()
print("RunTime: ")
print(endtime-starttime)
Now that we have compiled data files from both CitiBike and the weather data, we want to load that data into a Pandas dataframe for analysis. We iterate and load each file produced above, then assign each column with their appropriate data types. Additionally, we compute the Age Column after producing a default value for missing "Birth Year" values. This is discussed further in the Data Quality section.
%%time
# Create CSV Reader Function and assign column headers
def reader(f, columns):
d = pd.read_csv(f)
d.columns = columns
return d
# Identify All CSV FileNames needing to be loaded
path = r'Compiled Data'
all_files = glob.glob(os.path.join(path, "*.csv"))
# Define File Columns
columns = ["tripduration", "starttime", "stoptime", "start_station_id", "start_station_name", "start_station_latitude",
"start_station_longitude", "end_station_id", "end_station_name", "end_station_latitude",
"end_station_longitude", "bikeid", "usertype", "birth year", "gender", "LinearDistance", "DayOfWeek",
"TimeOfDay", "HolidayFlag", "PRCP", "SNOW", "TAVE", "TMAX", "TMIN"]
# Load Data
CitiBikeDataCompiled = pd.concat([reader(f, columns) for f in all_files])
# Replace '\N' Birth Years with Zero Values
CitiBikeDataCompiled["birth year"] = CitiBikeDataCompiled["birth year"].replace(r'\N','0')
# Convert Columns to Numerical Values
CitiBikeDataCompiled[['tripduration', 'birth year', 'LinearDistance','PRCP', 'SNOW', 'TAVE', 'TMAX', 'TMIN']]\
= CitiBikeDataCompiled[['tripduration', 'birth year','LinearDistance', 'PRCP', 'SNOW', 'TAVE', 'TMAX',
'TMIN']].apply(pd.to_numeric)
# Convert Columns to Date Values
CitiBikeDataCompiled[['starttime', 'stoptime']] \
= CitiBikeDataCompiled[['starttime', 'stoptime']].apply(pd.to_datetime)
# Compute Age: 0 Birth Year = 0 Age ELSE Compute Start Time Year Minus Birth Year
CitiBikeDataCompiled["Age"] = np.where(CitiBikeDataCompiled["birth year"]==0, 0,
CitiBikeDataCompiled["starttime"].dt.year - CitiBikeDataCompiled["birth year"])
# Convert Columns to Str Values
CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag', 'gender']] \
= CitiBikeDataCompiled[['start_station_id', 'end_station_id', 'bikeid', 'HolidayFlag','gender']].astype(str)
print(len(CitiBikeDataCompiled))
display(CitiBikeDataCompiled.head())
When analyzing our final dataset for accurate measures, there are a few key factors we can easily verify/research:
Computational Accuracy: Ensure data attributes added by computation are correct
Missing Data from Source
Although we are able to research these many factors, one computation still may still be lacking information in this dataset. Our LinearDistance attribute computes the distance from one lat/long coordinate to another. This attribute does not however tell us the 'true' distance a biker traveled before returning the bike. Some bikers may be biking for exercise around the city with various turns and loops, whereas others travel the quickest path to their destination. Because our dataset limits us to start and end locations, we do not have enough information to accurately compute distance traveled. Because of this, we have named the attribute "LinearDistance" rather than "DistanceTraveled".
Below we will walk through the process of researching the 'Measureable' data quality factors mentioned above:
To help mitigate challenges with time series data, we have chosen to break TimeOfDay into 5 categories. These Categories are broken down below:
To ensure that these breakdowns are accurately computed, we pulled the distinct list of TimeOfDay assignments by starttime hour. Looking at the results below, we can verify that this categorization is correctly being assigned.
# Compute StartHour from StartTime
CitiBikeDataCompiled["StartHour"] = CitiBikeDataCompiled["starttime"].dt.hour
# Compute Distinct Combinations of StartHour and TimeOfDay
DistinctTimeOfDayByHour = CitiBikeDataCompiled[["StartHour", "TimeOfDay"]].drop_duplicates().sort_values("StartHour")
# Print
display(DistinctTimeOfDayByHour)
#Clean up Variables
del CitiBikeDataCompiled["StartHour"]
In order to verify our computed DayOfWeek column, we have chosen one full week from 12/22/2013 - 12/28/2013 to validate. Below is a calendar image of this week to baseline our expected results:

To verify these 7 days, we pulled the distinct list of DayOfWeek assignments by StartDate (No Time). If we can verify one full week, we may justify that the computation is correct across the entire dataset. Looking at the results below, we can verify that this categorization is correctly being assigned.
# Create DataFrame for StartTime, DayOfWeek within Date Threshold
CitiBikeDayOfWeekTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
& (CitiBikeDataCompiled['starttime'].dt.month == 12)
& (CitiBikeDataCompiled['starttime'].dt.day >= 22)
& (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
["starttime", "DayOfWeek"]]
# Create FloorDate Variable as StartTime without the timestamp
CitiBikeDayOfWeekTest["StartFloorDate"] = CitiBikeDayOfWeekTest["starttime"].dt.strftime('%m/%d/%Y')
# Compute Distinct combinations
DistinctDayOfWeek = CitiBikeDayOfWeekTest[["StartFloorDate", "DayOfWeek"]].drop_duplicates().sort_values(
"StartFloorDate")
#Print
display(DistinctDayOfWeek)
# Clean up Variables
del CitiBikeDayOfWeekTest
del DistinctDayOfWeek
Using the same week as was used to verify DayOfWeek, w can test whether HolidayFlag is set correctly for the Christmas Holiday. We pulled the distinct list of HolidayFlag assignments by StartDate (No Time). If we can verify one holiday, we may justify that the computation is correct across the entire dataset. Looking at the results below, we expect to see HolidayFlag = 1 only for 12/25/2013.
# Create DataFrame for StartTime, HolidayFlag within Date Threshold
CitiBikeHolidayFlagTest = CitiBikeDataCompiled[(CitiBikeDataCompiled['starttime'].dt.year == 2013)
& (CitiBikeDataCompiled['starttime'].dt.month == 12)
& (CitiBikeDataCompiled['starttime'].dt.day >= 22)
& (CitiBikeDataCompiled['starttime'].dt.day <= 28)][
["starttime", "HolidayFlag"]]
# Create FloorDate Variable as StartTime without the timestamp
CitiBikeHolidayFlagTest["StartFloorDate"] = CitiBikeHolidayFlagTest["starttime"].dt.strftime('%m/%d/%Y')
# Compute Distinct combinations
DistinctHolidayFlag = CitiBikeHolidayFlagTest[["StartFloorDate", "HolidayFlag"]].drop_duplicates().sort_values(
"StartFloorDate")
#Print
display(DistinctHolidayFlag)
# Clean up Variables
del CitiBikeHolidayFlagTest
del DistinctHolidayFlag
Accounting for missing data is a crucial part of our analysis. At first glance, it is very apparent that we have a large amount of missing data in the Gender and Birth Year attributes from our source CitiBike Data. We have already had to handle for missing Birth Year attributes while computing "Age" in our Data Load from CSV section of this paper. This was done to create a DEFAULT value of (0), such that future computations do not result in NA values as well. Gender has also already accounted for missing values with a default value of (0) by the source data. Although we have handled these missing values with a default, we want to ensure that we 'need' these records for further analysis - or if we may remove them from the dataset. Below you will see a table showing the frequency of missing values(or forced default values) by usertype. We noticed that of the 4881384 Subscribing Members in our dataset, only 295 of them were missing Gender information, whereas out of the 680909 Customer Users (Non-Subscribing), there was only one observation where we had complete information for both Gender and Birth Year. This quickly told us that removing records with missing values is NOT an option, since we would lose data for our entire Customer Usertype. These attributes, as well as Age (Computed from birth year) will serve as difficult for use in a classification model attempting to predict usertype.
We have also looked at all other attributes, and verified that there are no additional missing values in our dataset. A missing value matrix was produced to identify if there were any gaps in our data across all attributes. Due to the conclusive results in our data, no missing values present, we removed this lackluster visualization from the report.
NADatatestData = CitiBikeDataCompiled[["usertype","gender", "birth year"]]
NADatatestData["GenderISNA"] = np.where(CitiBikeDataCompiled["gender"] == '0', 1, 0)
NADatatestData["BirthYearISNA"] = np.where(CitiBikeDataCompiled["birth year"] == 0, 1,0)
NAAggs = pd.DataFrame({'count' : NADatatestData.groupby(["usertype","GenderISNA", "BirthYearISNA"]).size()}).reset_index()
display(NAAggs)
del NAAggs
To ensure that there are no duplicate records in our datasets, we ensured that the number of records before and after removing potential duplicates were equal to eachother. This test passed, thus we did not need any alterations to the dataset based on duplicate records.
len(CitiBikeDataCompiled) == len(CitiBikeDataCompiled.drop_duplicates())
Trip Duration In analyzing a Box Plot on trip duration values, we find extreme outliers present. With durations reaching up to 72 days in the most extreme instance, our team decided to rule out any observation with a duration greater than a 24 period. The likelihood of an individual sleeping overnight after their trip with the bike still checked out is much higher after the 24 hour period. This fact easily skews the results of this value, potentially hurting any analysis done. We move forward with removing a total of 457 observations based on trip duration greater than 24 hours (86,400 seconds).
%%time
%matplotlib inline
#CitiBikeDataCompiledBackup = CitiBikeDataCompiled
#CitiBikeDataCompiled = CitiBikeDataCompiledBackup
# BoxPlot tripDuration - Heavy Outliers!
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
# How Many Greater than 24 hours?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]>86400]))
# Remove > 24 Hours
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["tripduration"]<86400]
Once outliers are removed, we run the boxplot again, still seeing skewness in results. To try to mitigate this left-skew distribution, we decide to take a log transform on this attribute.
%%time
%matplotlib inline
# BoxPlot Trip Duration AFTER removal of outliers
sns.boxplot(y = "tripduration", data = CitiBikeDataCompiled)
sns.despine()
# Log Transform Column Added
CitiBikeDataCompiled["tripdurationLog"] = CitiBikeDataCompiled["tripduration"].apply(np.log)
%%time
%matplotlib inline
# BoxPlot TripDurationLog
sns.boxplot(y = "tripdurationLog", data = CitiBikeDataCompiled)
sns.despine()
Age Similarly, we look at the distribution of Age in our dataset. Interestingly, it seems we have several outlier observations logging their birth year far enough back to cause their age to compute as 115 years old. Possible reasons for these outlier ages could be data entry errors by those who do not enjoy disclosing personal information, or possibly account sharing between a parent and a child - rendering an inaccurate data point to those actually taking the trip. Our target demographic for this study are those individuals under 65 years of age, given that they are the likely age groups to be in better physical condition for the bike share service. Given this target demographic, and the poor entries causing extreme outliers, we have chosen to limit out dataset to observations up to 65 years of age. This change removed an additional 53824 records from the dataset.
%%time
%matplotlib inline
# BoxPlot Age - Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
# How Many Greater than 65 years old?
print(len(CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]>65]))
# Remove > 65 years old
CitiBikeDataCompiled = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]<=65]
%%time
%matplotlib inline
# BoxPlot Age - removed Outliers!
sns.boxplot(y = "Age", data = CitiBikeDataCompiled[CitiBikeDataCompiled["Age"]!= 0])
sns.despine()
Given the extremely large volume of data collected, we have have decided to try to sample down to ~ 1/10th of the original dataset for a total of 500,000 records. Before taking this action, however we wanted to ensure that we keep data proportions reasonable for analysis and ensure we do not lose any important demographic in our data.
Below we compute the percentage of our Dataset that comprises of Customers vs. Subscribers. We want to make sure that our sample is representative of the population dataset, so we stratify our sample to match the original data proportions.
%matplotlib inline
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataCompiled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)
UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
Given these distribution percentages we are then able to compute the sample size for each usertype and then take a random sample within each group. Below you will see that our sampled distribution matches that of the original Dataset between Customers and Subscriber Usertypes.
SampleSize = 500000
CustomerSampleSize_Seed = int(round(SampleSize * 12.4 / 100.0,0))
SubscriberSampleSize_Seed = int(round(SampleSize * 87.6 / 100.0,0))
CitiBikeCustomerDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Customer'].sample(n=CustomerSampleSize_Seed, replace = False, random_state = CustomerSampleSize_Seed)
CitiBikeSubscriberDataSampled = CitiBikeDataCompiled[CitiBikeDataCompiled["usertype"] == 'Subscriber'].sample(n=SubscriberSampleSize_Seed, replace = False, random_state = SubscriberSampleSize_Seed)
CitiBikeDataSampled = pd.concat([CitiBikeCustomerDataSampled,CitiBikeSubscriberDataSampled])
print(len(CitiBikeDataSampled))
UserTypeDist = pd.DataFrame({'count' : CitiBikeDataSampled.groupby(["usertype"]).size()}).reset_index()
display(UserTypeDist)
UserTypeDist.plot.pie(y = 'count', labels = ['Customer', 'Subscriber'], autopct='%1.1f%%')
del CitiBikeDataCompiled
With the massive data set randomly sampled out to 500000 entries, we can more easily begin to explore the data available to us. First we began by running basic descriptive statistics for the data to get a high level, top down view of the Citi Bike Rentals we sampled.
With the first table, we looked at the categorical/non-numerical data that was available to us. For the stations found within the sampled data, we managed to sample the same number of unique stations that is found in the collected data set giving us a good start in assuming a representative sample.
CitiBikeDataSampled.describe(include=['O']).transpose()
Next we reviewed our numerical data, pulling mean, standard deviation, as well as the quartiles for each feature. Trip Duration, as a reminder, is measured in seconds, with the minimum trip duration being 1 minute. Shorter trip durations were elimindated by Citi Bike in the source data to remove possible errors as a result of people putting a bike back after removing it among other possible explanations for such short ride times.
Among all the numerical statistics, we're largely focused on Age, Trip Duration, Linear Distance as well as the weather statistical data. Birth Year was primarily used to calculate Age from trip start date. The data presented here for station latitudes and longitudes is largely unable to be used in this form, but will prove valueable for constructing location heatmaps.
CitiBikeDataSampled.describe().transpose()
Before we begin visualization some of the more specific variable interactions it's important to see the magnitude of possible correlations which we found using Panda's Pearson Correlation function. We'll ignore the obvious, such as birth year's correlation with age and tripdurationLog's correlation with tripduration, as well as most correlations involving latitude and longitude.
Right away we noticed the large correlations between start station coordinates and end station coordinates. At first, we suspected that it was because renters would start and end at the same station, but we ruled that out by running a count of all records where the start_station_id equaled the end_station_id and found it to contain less than 2.5% of all records. It's far more likely that we're probably seeing colinearity due to the close proximity of all the stations and the low variance between coordinates.
Other than that, no singular set of features displayed strong correlation. Possibly for tripduration and LinearDistance, which is only to be expected. That said, Age did seem to have some indication of correlation with trip duration, if weak, as well as correlations with the weather attributes. This possibly indicates that age plays a factor in determining whether or not somone rents a bike during certain weather, and how far or how long they travel.
CitiBikeDataSampled.corr()
CitiBikeDataSampled.query('start_station_id == end_station_id')["start_station_id"].count() / CitiBikeDataSampled["start_station_id"].count()
We also examined covariance among numerical attributes finding relatively high covariance between tripduration and the known weather attributes as well as age, further cementing the idea that these factors play into whether or not a person decides to rent and if they do, how long they ride for. Eventually we'll explore at what point these decisions to ride ultimately culminate in the transition from customer to describer, but for now, let's examine age.
CitiBikeDataSampled.cov()
With age, we found a right skewed distribution (common with population age variables) with the majority of our renters falling below 35. As an additional note, nearly all the age data was provided by subscribers as most customers (non-subscribers) either did not input an age, or input an age that fell outside our expectations as described in data quality. All further analysis of age and its relationship with other attributes will be done through the lens of knowing that it only describes subscribers and not necessarily the entire population of Citi Bike Riders/Renters.
sns.distplot(CitiBikeDataSampled.query('Age != 0')["Age"])
As part of our initial exploration of the data and building out all the attributes we could for examining the data set was attempting to see how far people were traveling during their rental period. Shy of making navigation calls to Google Maps or someother navigation service, we decided to calculate linear distance between the start and end station to give us the approximate distance traveled. We acknowledge, however, that riders weren't necessarily traveling from one station to another directly - quite a few rode for several hours and returned their bikes to stations only blocks apart. Likewise, there existed a number of riders that apparently traveled zero miles despite having trip durations in the minutes.
Below is a joint grid plotted out with distribution graphs on each axis for tripdurationLog and tripdistance. While it's not a perfect correlation, we do see that both are skewed in regards to higher values with a positive correlation. Further analysis will be necessary to draw any statistically significant conclusions, but this data combined with a study on average biking speed could be used to determine whether or not riders are simply using the bikes as transportation from one station to another, or as a means to travel outside the range of those stations.
dvd = sns.JointGrid(x="tripdurationLog", y="LinearDistance", data=CitiBikeDataSampled.query("LinearDistance > 0"))
dvd = dvd.plot(sns.regplot, sns.distplot)
To re-iterate, our main objectives in analyzing these data are to determine which attributes have greatest bearing on predicting a rider's type (Customer vs. Subscriber) and to gain a better understanding of rider behavior as a function of external factors. Many attributes in this data set will eventually be used in subsequent labs to answer these questions. The primary attributes on which we will focus our attention in this section, however, are as follows:
Over the course of this section, we will review these top attributes in some detail and discuss the value of using our chosen visualizations. Note also that merged weather data is of significant interest as well. As we desire to heavily compare weather conditions against various rider habits, however, we will refrain from focusing on weather-related attributes until the subsequent sections.
Before discussing the following heatmap in detail, it is worth noting some special steps required to use the gmaps module in Python in case the reader is interested in rendering our code to plot data on top of Google's maps (Note full instructions are available at https://media.readthedocs.org/pdf/jupyter-gmaps/latest/jupyter-gmaps.pdf)
Besides having Jupyter Notebook installed on one's computer with extensions enabled (default if using Anaconda) and installing the gmaps module using pip, the following line should be run from within the command terminal. This is only to be done once and should be done when Jupyter Notebook is not running.
$ jupyter nbextension enable --py gmaps
In addition to running the above line in the command prompt, a Standard Google API user key will need obtained from https://developers.google.com/maps/documentation/javascript/get-api-key. This only needs done once and is necessary to pull the Google map data into the Jupyter Notebook environment. The key is entered in the gmaps.configure() line as shown in the below cell. We have provided our own private key in the meantime for the reader's convenience.
Now on to the data visualization... This geological heatmap visualization is interactive; however, the kernel must run the code block each time our Jupyter Notebook file is opened due to the API key requirement. Therefore, we've captured some interesting views to aid in our discussion and have included them as embedded images.
The start station heatmap represents the start station location data via attributes start_station_latitude and start_station_longitude. It identifies areas of highest and lowest concentration for trip starts. The location data is important as it helps us understand where the areas of highest activity are and, as will be seen in one of our later sections, will play an important role in identifying riders as regular customers or subscribers.
%%time
gmaps.configure(api_key="AIzaSyAsBi0MhgoQWfoGMSl5UcD-vR6H76cntxg") # Load private Google API key
locations = CitiBikeDataSampled[['start_station_latitude', 'start_station_longitude']].values.tolist()
m = gmaps.Map()
heatmap_layer = gmaps.Heatmap(data = locations)
m.add_layer(heatmap_layer)
An overall view quickly reveals that station data was only provided for areas of NYC south of Manhattan and mostly north of Brooklyn. This could either mean that the bike share program had not yet expanded into these other areas at the time of data collection or that the data simply wasn't included (as mentioned previously, many test sites were being used during this time frame but CitiBike did not include them with this data set).
Within the range of trip start frequency from the least number of trips (green) to the most trips (red), green and yellow indicate low to medium trip activity in most areas. However, higher pockets of concentration do exist in some places. We will attempt to put this visualization to good use by focusing in on one of these hotspots.
m

A prominant heatspot occurs just east of Bryant Park and the Midtown map label. Zooming into this area (via regular Google Map controls as the rendered visual is interactive) allows for a closer look. A snapshot of this zoomed in image is embedded below. The hotspot seems slightly elongated and stands out from among the other stations. Zooming in further will help to understand why this is and may shed some light on the higher activity in this area.

Zooming in to this area further helps us see that two stations are very close together. Even so, why might there be such high rider activity at these stations? This higher activity is likely affected by the stations' proximity to the famous Grand Central Station. As commuters and recreationalists alike arrive by train at Grand Central, it is natural that many of them may choose to continue their journey via the two closest bike share stations nearby. When the northernmost bike share station runs out of bikes, riders likely go to the next station to begin their ride instead.
By understanding the dynamics of geographical activity within this data set and the amenities that surround each station, we will be able to more efficiently leverage the data to make our classification and regression predictions.

Another attribute of interest is that of trip duration and days of the week. It can be expected that activity should vary depending on what day of the week riders are traveling. Not only are trip durations expected to vary, but with further analysis we expect the days of travel to have some influence on whether riders are bike share subscribers or not. To obtain a quick understanding of day-to-day variance in trip duration, a box plot is used.
The following interactive box plots are setup such that the log of trip duration is on the y-axis with the categorical days of the week along the x-axis. Once again, the log transformed trip duration is used to help normalize the distribution for easier analysis. The plots reveal that trip durations do not vary much throughout the week. However, there is some increase on the weekends. Zooming in on the plot to focus in mainly on the IQR regions helps to put this increase in activity into perspective.
Though further analysis would be required, the data suggests that riders are spending more time riding on Saturdays and Sundays. Of greater interest will be Customer vs. Subscriber activity across each day of the week. This will be discussed further later.
td = CitiBikeDataSampled.tripdurationLog
sun = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Sunday']
mon = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Monday']
tue = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Tuesday']
wed = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Wednesday']
thu = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Thursday']
fri = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Friday']
sat = td.loc[CitiBikeDataSampled["DayOfWeek"] == 'Saturday']
sunday = go.Box(y=sun, name='Sunday')
monday = go.Box(y=mon, name='Monday')
tuesday = go.Box(y=tue, name='Tuesday')
wednesday = go.Box(y=wed, name='Wednesday')
thursday = go.Box(y=thu, name='Thursday')
friday = go.Box(y=fri, name='Friday')
saturday = go.Box(y=sat, name='Saturday')
layout = go.Layout(title='Log Trip Duration by Day of Week', xaxis=dict(title='DayOfWeek'), yaxis=dict(title='tripdurationLog (log sec)'))
data = [sunday, monday, tuesday, wednesday, thursday, friday, saturday]
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
In follow-up to the previous log trip duration box plots, it is also worth reviewing linear distance traveled by riders for each day of the week. Linear distance traveled also appears to remain constant throughout the week for all riders. Unlike the trip duration, there seems to be little change in distance even on the weekends. While change in distance does remain constant throughout the week, grouping changes in distance by other categories later may help to reveal more about rider activity in the data. This will be the case with user typer as seen shortly.
ld = CitiBikeDataSampled.LinearDistance
sun = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Sunday']
mon = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Monday']
tue = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Tuesday']
wed = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Wednesday']
thu = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Thursday']
fri = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Friday']
sat = ld.loc[CitiBikeDataSampled["DayOfWeek"] == 'Saturday']
sunday = go.Box(y=sun, name='Sunday')
monday = go.Box(y=mon, name='Monday')
tuesday = go.Box(y=tue, name='Tuesday')
wednesday = go.Box(y=wed, name='Wednesday')
thursday = go.Box(y=thu, name='Thursday')
friday = go.Box(y=fri, name='Friday')
saturday = go.Box(y=sat, name='Saturday')
layout = go.Layout(title='Linear Distance by Day of Week', xaxis=dict(title='DayOfWeek'), yaxis=dict(title='LinearDistance (miles)'))
data = [sunday, monday, tuesday, wednesday, thursday, friday, saturday]
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
As discussed previously, we do not have complete coverage in our dataset of gender values. Nevertheless, we are interested to know if there are significant differences between male and female bikers. Since the majority of customers (i.e. non-subscribers) do not provide CitiBike with gender details, this research is indicative mainly for subscribing members. We chose to first build a violin plot with the day of week on the x-axis and log trip duration on the y axis. We broke down the data into two separate violins per day for male vs. female results. As discussed earlier, we see consistencies in trip duration from day to day in both male and female bikers. Interestingly females, in general, have ridden for longer durations than males. One could speculate that this could be due to additional excursions (e.g. shopping, events, etc.), or slower biking speeds during the trip.
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load our subset data set
sub = CitiBikeDataSampled.query('gender != "0"')
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="gender", data=sub, split=False,
inner="quart", palette={"1": "b", "2": "pink"}, linewidth=0.5,
order=["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
sns.despine(left=True)
To complement these findings above (Log Trip Duration by Day of Week and Gender) and attempt to gain some additional insight into some of the reasoning behind females with longer trip durations, we build a second violoin plot. This time, we chose to plot day of week on the x-axis and LinearDistance on the y-axis. This yielded much less insightful information, as "linear distance" is not indicative of the actual "trip distance". The plotted linear distances appear to be fairly consistent both from day to day (as discussed previously), and between males and females. Further research and / or more insightful data points, such as "trip distance", or "average trip speed" could help gain further insights. These additional data points will be discussed later in the paper.
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="gender", data=sub, split=False,
inner="quart", palette={"1": "b", "2": "pink"}, linewidth=0.5,
order=["Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"])
sns.despine(left=True)
Another attribute that may potentially reveal much about Customer vs. Subscriber activity is time of day. Again, the time of day attribute is a categorical variable with a group assigned to a ride depending on which range of hours throughout the day the bike ride begins. While trip durations are expected to change by time of day, it would likely be misleading to lump times of day regardless of days of the week since rider activity is expected to change based on work schedule or weekend activities.
The interactive stacked bar plot below is more suited for raw trip duration data rather than log-transformed data. The raw data accentuates the true differences from day-to-day and across time slots. Because the raw data is strongly left-skewed, the median value from each day-time grouping was obtained and rendered in the plot. By stacking the median trip durations for each time slot for each respective day, we are able to better understand the time slot proportionality differences across each day of the week. Stacking the time slot median trip durations also provides a total sum of the time slot medians for each day, making it easier compare overall activity for each day.
Correlating with the "Box Plot for Log Trip Duration by Day of the Week" visualization above, trip duration activity does increase on the weekends. Not only that, but hovering over each day bar reveals that Midday, Afternoon, and Evening activity are noticeably higher on Saturdays and Sundays than they are on weekdays whereas Morning and Night activity is relatively consistent. Understanding these trends may help us understand rider intent throughout the week and how it affects their decision to subscribe to Citi Bike or not. Being cognisant of these trends may also improve inventory at some stations at peaks times of the day when also considering start and end location frequency.
td = CitiBikeDataSampled.tripduration
# Extract morning data
sunMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
monMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
tueMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
wedMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
thuMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
friMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
satMorning = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Morning')]
# Extract midday data
sunMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
monMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
tueMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
wedMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
thuMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
friMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
satMidday = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Midday')]
# Extract afternoon data
sunAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
monAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
tueAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
wedAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
thuAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
friAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
satAfternoon = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Afternoon')]
# Extract evening data
sunEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
monEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
tueEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
wedEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
thuEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
friEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
satEvening = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Evening')]
# Extract night data
sunNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Sunday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
monNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Monday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
tueNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Tuesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
wedNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Wednesday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
thuNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Thursday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
friNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Friday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
satNight = td.loc[(CitiBikeDataSampled["DayOfWeek"] == 'Saturday') & (CitiBikeDataSampled["TimeOfDay"] == 'Night')]
# Compute morning averages by day
def compAvgs(sun, mon, tue, wed, thu, fri, sat):
avgs = []
avgs.append(statistics.median(sun))
avgs.append(statistics.median(mon))
avgs.append(statistics.median(tue))
avgs.append(statistics.median(wed))
avgs.append(statistics.median(thu))
avgs.append(statistics.median(fri))
avgs.append(statistics.median(sat))
return avgs
morningAvg = compAvgs(sunMorning, monMorning, tueMorning, wedMorning, thuMorning, friMorning, satMorning)
middayAvg = compAvgs(sunMidday, monMidday, tueMidday, wedMidday, thuMidday, friMidday, satMidday)
afternoonAvg = compAvgs(sunAfternoon, monAfternoon, tueAfternoon, wedAfternoon, thuAfternoon, friAfternoon, satAfternoon)
eveningAvg = compAvgs(sunEvening, monEvening, tueEvening, wedEvening, thuEvening, friEvening, satEvening)
nightAvg = compAvgs(sunNight, monNight, tueNight, wedNight, thuNight, friNight, satNight)
# Define bar plot features
x=['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday']
morning = go.Bar(x=x, y=morningAvg, name='Morning')
midday = go.Bar(x=x, y=middayAvg, name='Midday')
afternoon = go.Bar(x=x, y=afternoonAvg, name='Afternoon')
evening = go.Bar(x=x, y=eveningAvg, name='Evening')
night = go.Bar(x=x, y=nightAvg, name='Night')
# Combine features and render bar plot
data = [morning, midday, afternoon, evening, night]
layout = go.Layout(barmode='stack', title='Trip Duration by Day of Week and Time of Day',
xaxis=dict(title='DayOfWeek'), yaxis=dict(title='tripdurationLog (sec)'))
fig = go.Figure(data=data, layout=layout)
py.offline.iplot(fig)
Our team was interested in the correlation between the Log of Trip Duration and Age. To accomplish this, we decided to produce a joint density correlation plot between the log trip duration and Age. Once again, due to missing "birth year" data in our customer (non-subscribing) users - these insights are slightly skewed in terms of interpreting a pearson's r value produced by the correlation plot. Further computations, only on the dataset for those observations that do not contain an age of zero, provide a pearson's R value of .048. This is a very low positive correlation, confirming what is seen visually in the joint plot, with very little change in trip duration as rider age increases. What was interesting to see was the difference between the Age = 0 density rings, vs. the rest of the dataset. As was discussed previously, Age = 0 records mainly consist of customer (non-subscribing) users - thus we can see that the majority of trip durations for customers is greater than that of subscribing users. This is discussed in more detail later in the paper, but one could infer that this is possibly due to subscriber usage via routine trips vs. customer usage via events, trails, etc. Finally, we can easily see via the core density ring areas that the majority of trips by subscribing members are taken by individuals between ~25-35 years old and ~518 seconds (e^6.25). This matched our suspected core member age demographic, as these are likely working individuals in good physical condition for consistent riding.
cont = sns.jointplot(x=CitiBikeDataSampled.Age, y=CitiBikeDataSampled.tripdurationLog, kind='kde', color='r', )
pearsondata = CitiBikeDataSampled[CitiBikeDataSampled["Age"]!= 0][["Age", "tripdurationLog"]]
print(pearsonr(pearsondata.Age,pearsondata.tripdurationLog))
del pearsondata
Although we have previously analyzed plots for both day of week and time of day trip durations, we were interested in further exploring in a different light the median trip duration values within specific days and / or time of days groupings. To do this, we produced a HeatMap of Median Trip Duration raw values (Median within Time of Day and Day of Week Groupings) with Time of Day (Ordered Morning - Night) and Day of Week (Ordered Sun. - Sat.) on the Y Axis. Right off the bat, we see the grouping pair with the largest median trip duration is Saturday Afternoons (2-5PM). Also, we can see that weekend trips are generally longer than weekday trips, with emphasis on Midday - Evening starttimes. Interestingly, we see that consistently, trip durations are higher in the evenings - probably due to travelers using the services after work hours. Most of these results, were what the team expected, and had hoped to see. In general, using median times within groups of [Day of Week, Time of Day] pairs, we observed that evenings and weekends received the largest trip durations, whereas weekday mornings received the lowest trip durations. This information could be useful for Citi Bike, in that they would know which times of day / days of week they needed to focus on driving promotions, events, etc. to increase traffic flow. Also, knowing that trips are generally longer during the weekend, could explain reasons for bike availability concerns. Bike availability is a huge part of this bike share service, and creates a huge impact on travelers satisfaction when bikes aren't available at their closest station. Combining this information with geocoordinate density maps discussed in this paper could help them support bike station shift services - moving bikes from less travelled areas to heavier travelled areas. This could potentialy help mitigate some availability loss in their service.
sns.set()
grouped = CitiBikeDataSampled.groupby(['DayOfWeek', 'TimeOfDay'], as_index=False)
groupAgg = grouped.aggregate(np.median)
groupAgg['DayOfWeek'] = pd.Categorical(groupAgg['DayOfWeek'], ['Sunday',
'Monday',
'Tuesday',
'Wednesday',
'Thursday',
'Friday',
'Saturday'])
groupAgg['TimeOfDay'] = pd.Categorical(groupAgg['TimeOfDay'], ['Morning',
'Midday',
'Afternoon',
'Evening',
'Night'])
groupAgg = groupAgg.sort_values(by=['DayOfWeek','TimeOfDay'])
dist0 = groupAgg[["DayOfWeek", "TimeOfDay", "tripduration"]]
dist1 = dist0.pivot("DayOfWeek", "TimeOfDay", "tripduration")
# Render DayOfWeek vs. TimeOfDay heatmap for
sns.heatmap(dist1, annot=True, fmt="f", linewidths=0.01)
We were interested in seeing the effect temperature has on trip duration. To achieve this we created a joint density correlation plot between Average Temperature (TAVE) and the Log of Trip Duration (tripdurationLog). Interestingly, we did not see as high of a correlation as expected. With a Pearsons R correlation values of .15, there is a very small positive correlation of increasing trip duration (as depicted by a log transformation) as average temperatre increases. The team expected much more correlation, as we thought that individuals would not enjoy being on a bike during cold weather. We do however, see that the density of trip durations is highest around 75 degrees fahrenheit. This matched our expectations because although durations remained fairly unchanged, the number of trips taken decreased as average temperature decreased. This is also depicted by the skewed distribution shown on the Y axis(right side of plot). Possibly, the reason we did not see as large of a change in trip duration is due to the nature of subscriber usage of the bike share service. It is possible that subscribers use the bike share service for routine travel around the city: grocery trips, work trips, trips to meet friends, etc.. All of these things, if the bike share service is a core means for transportation, have not changed distances due to cold weather, therefore you see the number of riders decrease while keeping durations mostly consistent. Further research on this correlation between subscribers vs. customers and potentially some surveys to subscribing members could assist with proving this theory. If these insights are true during cold weather months, it could help marketing promotions for the service to working individuals in attempts to increase bike share traffic for routine trips.
%%time
cont = sns.jointplot(x=CitiBikeDataSampled.tripdurationLog, y=CitiBikeDataSampled.TAVE, kind='kde')
#cont.plot_joint(plt.scatter, c="w", s=3, linewidth=0.5, marker=".")
At the core of our business understanding is wanting to identify the point at which a rider moves from being a customer to a subscriber. Being able to know what qualities or features of the two separate the roles and to what degree would allow us to identify stations and areas that hold the most potential for subscriber enrollment.
Almost universally, across every day of the week, customers appear to have a higher trip duration than subscribers. While additional analysis will be required to confirm this, it's possible that one explanation is that subscribers can freely take and return their bikes which means that they're more willing to make shorter trips versus customers that pay each time they want to rent a bike in the first place. An alternate explanation, based on what we know in regards to the relationship between trip duration and linear distance traveled, is that subscribers are using the bikes for commuting to and from specific locations. This would result in a lower trip duration than customers that might use their bikes for general travel around the city. This possibility is corroborated by the decrease in activity on the weekends by subscribers.
Identifying the point at which a customer might become a subscriber using this data would probably include monitoring weekday activity and trip duration. If a station has a lot of customers with trip durations similar to those of subscribers, then that station would be a good location to do a focused advertisement of the benefits of subscribing.
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load our subset data set
sub = CitiBikeDataSampled
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="tripdurationLog", hue="usertype", data=sub, split=True,
inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
Unlike trip duration, the linear distance between start and end stations for both customers and subscribers appear to be similar in regards to means and are close in their quartiles. But what's noticeable here, is that customers are more widely distributed in how far or near they ride, with a significant increase in the number of customers that return their bikes to the station they started from.
Further analysis will be necessary to explore the statistical significance of these differences, but it would be possible to identify those stations that are frequented by subscribers and assume that most stations within the first standard deviation of the linear distance found below to be considered "subscriber stations" and then seen which stations are outside of those zones to further build up the messaging encouraging subscription. Furthermore, by identifying those "hot zones" it's possible to rotate out bikes to increase their longevity.
sns.set(style="whitegrid", palette="pastel", color_codes=True)
# Load our subset data set
sub = CitiBikeDataSampled
# Draw a nested violinplot and split the violins for easier comparison
sns.violinplot(x="DayOfWeek", y="LinearDistance", hue="usertype", data=sub, split=True,
inner="quart", palette={"Subscriber": "g", "Customer": "y"})
sns.despine(left=True)
After visualizing the overall dataset locations with a heatmap over NYC, we decided to take the visualization one step further. This time, we broke the dataset into two segments: Customer vs. Subscriber. Below is two separate gmap heatmaps containing geographic densities for each usertype. What we found assisted our theories on customer vs. subscriber usage tendencies. Seen first, the Customer heatmap overall contains much fewer dense regions. This helps to confirm our suspicions infering Customer bikers as less "routine" than subscribing bikers. When looking around for the most dense region in this heatmap, one point stuck out as particularly interesting: The Zoo. When comparing this region on the Subscriber gmap, we did not see the same type of traffic! This helps assist our theories that customer bikers use the service more for events, shopping, or one-time use convenience. On the subscriber gmap, the most dense region, is that near the Grand Central Station as discussed earlier - assisting in the opposing theory for subscribing members as routine trips to work, groceries, etc. as they consistently use the bike share service as a means to reach the metro station.
Customer Users
customerData = CitiBikeDataSampled.query('usertype == "Customer"')
customerLoc = customerData[['start_station_latitude', 'start_station_longitude']].values.tolist()
cmap = gmaps.Map()
customer_layer = gmaps.Heatmap(data=customerLoc)#, fill_color="red", stroke_color="red", scale=3)
cmap.add_layer(customer_layer)
cmap

Subscriber Users
subscriberData = CitiBikeDataSampled.query('usertype == "Subscriber"')
subscriberLoc = subscriberData[['start_station_latitude', 'start_station_longitude']].values.tolist()
smap = gmaps.Map()
subscriber_layer = gmaps.Heatmap(data=subscriberLoc)#, fill_color="green", stroke_color="green", scale=2)
smap.add_layer(subscriber_layer)
smap

Because we were able to bring together historical weather data for the dates we had in our records, we wanted to explore the relationship these variables had with our usertype status. If subscribers were regularly using the bikes for commuting as we've begun to see, then weather wouldn't impact their rental stastics as much as customers who appear to be primarily opportunistic in their usage.
A quick cursory glance reveals a noticeable difference in bike rentals in regards to low temperatures, precipitation, and snowfall. While true, there are fewer customers than subscribers, we're concerned primarily with the spread or distribution of the plot points rather than the quantity. And we can see that on the customer pair plots that there are fewer points distributed across the lower temperature ranges and higher precipitation/snowfall ranges. The distributions pick back up at higher temperatures and lower precipitation points between the two usertypes.
If stations consistently see use during "bad" weather, then those stations could be identified as subscriber stations. Further, if certain customers are found making the same trips consistently in all weather types, then they could be pushed for subscription.
sns.pairplot(CitiBikeDataSampled.query("usertype == 'Subscriber'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
sns.pairplot(CitiBikeDataSampled.query("usertype == 'Customer'"), x_vars=["PRCP","SNOW","TAVE","TMAX","TMIN"], y_vars=["tripduration","tripdurationLog","LinearDistance"])
Beyond the features we already chose to add to the original data set, there are others of particular interest that would bring much value to the existing data as well. We've documented some of our ideas below:
Our team spent a substantial amount of time completing this assignment (50+ total man hours). Some of the features we would consider exceptional work include: